defense method
the Fine tuning Process of on Poisoned
In this section, we show our empirical observations obtained from fine-tuning PLMs on poisoned494 datasets. Specifically, we demonstrate that the backdoor triggers are easier to learn from the lower495 layers than the features corresponding to the main task. This observation plays a pivotal role in496 designing and understanding our defense algorithm. In our experiment, we focus on the SST-2497 dataset [30] and consider the widely adopted word-level backdoor trigger and the more stealthy498 style-level trigger. For the word-level trigger, we follow the approach in prior work [25] and adopt the499 meaningless word "bb" as the trigger to minimize its impact on the original text's semantic meaning.500
Defending against Data-Free Model Extraction by Distributionally Robust Defensive Training
Data-Free Model Extraction (DFME) aims to clone a black-box model without knowing its original training data distribution, making it much easier for attackers to steal commercial models. Defense against DFME faces several challenges: (i) effectiveness; (ii) efficiency; (iii) no prior on the attacker's query data distribution and strategy. However, existing defense methods: (1) are highly computation and memory inefficient; or (2) need strong assumptions about attack data distribution; or (3) can only delay the attack or prove a model theft after the model stealing has happened. In this work, we propose a Memory and Computation efficient defense approach, named MeCo, to prevent DFME from happening while maintaining the model utility simultaneously by distributionally robust defensive training on the target victim model. Specifically, we randomize the input so that it: (1) causes a mismatch of the knowledge distillation loss for attackers; (2) disturbs the zerothorder gradient estimation; (3) changes the label prediction for the attack query data. Therefore, the attacker can only extract misleading information from the black-box model. Extensive experiments on defending against both decision-based and scorebased DFME demonstrate that MeCo can significantly reduce the effectiveness of existing DFME methods and substantially improve running efficiency.
Constructing Unrestricted Adversarial Examples with Generative Models
Adversarial examples are typically constructed by perturbing an existing data point within a small matrix norm, and current defense methods are focused on guarding against this type of attack. In this paper, we propose a new class of adversarial examples that are synthesized entirely from scratch using a conditional generative model, without being restricted to norm-bounded perturbations. We first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the class-conditional distribution over data samples. Then, conditioned on a desired class, we search over the AC-GAN latent space to find images that are likely under the generative model and are misclassified by a target classifier. We demonstrate through human evaluation that these new kind of adversarial images, which we call Generative Adversarial Examples, are legitimate and belong to the desired class. Our empirical results on the MNIST, SVHN, and CelebA datasets show that generative adversarial examples can bypass strong adversarial training and certified defense methods designed for traditional adversarial attacks.
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
However, Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? In this paper, we provide an affirmative answer to this question by thoroughly investigating the Post-Purification Robustness of current backdoor purification methods.